Looking for datasets to explore, I came across a dataset on Kaggle listing ### UFO sightings over the last century. Performed some basic cleaning, EDA and visualization on it. Here goes my attempt to present it!
Importing the basic stuff and creating the dataframe
import pandas as pd
import numpy as np
import plotly.offline as pyo
df = pd.read_csv(r"../data/complete.csv",error_bad_lines=False,warn_bad_lines=False,low_memory=False)
Let us take a look at the dataframe columns:
df.columns
11 variables out of which all but latitutde and longitude are categorical and even lat/long do not have numerical significance
Taking a look at 5 samples from the dataframe.
df.sample(5)
All the columns are pretty simple to understand, let's remove the 'duration (hours/minutes) column - vague values(eg: 2 nights, 1-2 hours) and since seconds column has same info.
df.drop(labels='duration (hours/min)',axis=1,inplace=True)
Taking a count of null values:
#let's take a count
df.isnull().sum()
State and Country have the maximum null values:
Let us try to remove these by the following method: Let's first see the 'country' for the 'state' == 'tx'
df[df['state']=='tx']['country']
Let's check the number of null values here:
df[df['state']=='tx']['country'].isnull().sum()
The mode here is very clearly 'us' but yet we have 299 null values. So we can replace the null values with 'us'.
Why not extend this principle and:
Iterate through each unique city and fill up empty country values
Iterate through each city and fill up empty state values
Iterate through each state and fill up empty country values
Lets check number of unique cities
df['city'].describe()
#there are ~22k unique cities and these can be used to fill many NaN country/state values
city_list=list(df['city'].unique()) #Creating a list of all unique cities:
Fill empty using , for example - fill empty countries using city.
#let's define a function to perform the tasks mentioned above - fill empty __ using __
def fill_missing(fill_missing,fill_using): #both string values - eg: fill_missing = 'country', fill_using = 'city'
unique_list = list(df[fill_using].unique()) #eg:list of unique cities
for elem in unique_list: #for each city in list
corresponding_missing_series = df[df[fill_using]==elem][fill_missing].mode() #eg: corresponding country in series form
if len(corresponding_missing_series!=0): #if mode exists
corresponding_missing = corresponding_missing_series.iloc[0]
df.loc[df[fill_using]==elem,fill_missing] = df.loc[df[fill_using]==elem,fill_missing].fillna(corresponding_missing) #replace all empty
Filling up empty countries using city:
fill_missing('country','city')
Since the function above might not be very clear in what it is doing - let us show what that task would look like without defining it as a function
Filling up empty countries using city:
# #now iterating for each city
# for city in city_list: # city list is a list of all unique cities
# corresponding_country_series=df[df['city']==city]['country'].mode() #finding out the country which corresponds to that city
# if len(corresponding_country_series!=0): #if a mode exists
# corresponding_country=corresponding_country_series.iloc[0]
# df.loc[df['city']==city,'country'] = df.loc[df['city']==city,'country'].fillna(corresponding_country) #replace all empty values with mode
Let's see how many null values we filled:
df['country'].isnull().sum()
Managed to fill ~1k empty fields in country.
Filling up empty states using city
fill_missing('state','city')
What the function actually did:
# #we already have city_list
# for city in city_list:
# corresponding_state_series=df[df['city']==city]['state'].mode() #finding out the state which corresponds to that city
# if len(corresponding_state_series!=0):
# corresponding_state=corresponding_state_series.iloc[0]
# df.loc[df['city']==city,'state'] = df.loc[df['city']==city,'state'].fillna(corresponding_state)
Let's see how many empty states we filled up:
df['state'].isnull().sum()
Only ~500 empty states filled.
Let's try to fill some more country fields by repeating the same code with states:
Filling up empty country using states
fill_missing('country','state')
Again, illustrating what our function did in the form of code just for clarity:
state_list=list(df['state'].unique())
#getting a list of all states
Filling up empty country using states
# #now iterating for each state
# for state in state_list:
# corresponding_country_series=df[df['state']==state]['country'].mode() #finding out the country which corresponds to that city
# if len(corresponding_country_series!=0):
# corresponding_country=corresponding_country_series.iloc[0]
# df.loc[df['state']==state,'country'] = df.loc[df['state']==state,'country'].fillna(corresponding_country)
df['country'].isnull().sum()
7K null columns filled up in country!
Now checking empty df values:
df.isnull().sum()
Let's remove rows with empty countries and empty states now.
df.dropna(subset=['country'],inplace=True)
df.dropna(subset=['state'],inplace=True)
df.isnull().sum()
Finally, we parse the datetime and date_posted columns to datetime64 format.
df.dropna(subset=['duration (seconds)'],inplace=True) #dropping the one empty row remaining
#let's parse the datetime column into datetime64 format
df['datetime'] = df['datetime'].str.replace("24:00","00:00") #there were columns with time as 24:00 - changing to 00:00
df['datetime_parsed'] = pd.to_datetime(df['datetime'],format ="%m/%d/%Y %H:%M" )
#similarly parsing date posted
df['date_posted_parsed'] = pd.to_datetime(df['date posted'],infer_datetime_format=True)
Converting 'duration (seconds)' to float32 type:
df['duration (seconds)'] = df['duration (seconds)'].str.replace("`","") #lot of values had ` attached to them.
df['duration (seconds)'] = df['duration (seconds)'].astype(dtype='float32')
Data cleaning done!
1. Which geographic regions (city/state/country-wise) have the maximum ufo sightings?
2. Is there any truth to the phrase that aliens only want invade the US(:P)?
3. How can you utilize the 'comments' column to get some/any meaningful insight?
4. Insights, if any from the shape of sightings?
1. Which city has the maximum ufo sightings?
2. Segregate the country according to the sightings count
3. Out of the cities where sightings have occured, where have the minimum sightings occured?(say for eg, only 1)
4. Show the cities with only higher volume of sightings (more than 50)
5. Out of the cities of sightings, how many are in the USA?
6. How many distinct shapes of UFO's have been sighted?
7. Most common shape of ufo sighted?
8. Most unusual shape of ufo sighted?
9. Average time duration for which ufo's are seen?
10. Group shapes by their mean duration
xx----xx
1. Which city has the maximum UFO sightings?
Top 20 cities according to sightings frequency:
(Seattle tops the list)
Interestingly all 20 are cities in the US.
df['city'].value_counts()[:20]
2. Segregate the countries according to the sightings count
Only 4 countries, however US does top the list.
df['country'].value_counts()
3. City with least volume of sightings - only 1
Out of ~17k unique cities - ~10k have 1 sighting only i.e more than 50 percent.
df_city_count = df['city'].value_counts().reset_index().rename(columns={'index':'city','city':'count'}) # df with count of cities
df_city_count[df_city_count['count']==1][['city','count']]
4.Cities with a high volume (more than 50) sightings
Only 225 cities have greater than 50 sightings i.e Less than 2 percent.
df_city_count[df_city_count['count']>=50][['city','count']]
5. Out of the cities of sightings, how many are in the USA?
Out of ~17k, ~16k are in the US, overwhelming majority.
df[df['country']=='us']['city'].value_counts()
6. How many distinct shapes of UFO's have been sighted?
29! Not that many.
df['shape'].describe()
7. Most common shape of ufo sighted?
Light (?)
8. Most unusual shape of ufo sighted?
Pyramid, Flare, Changed(?), Hexagon, Dome.
df['shape'].value_counts()
9. Average time duration for which ufo's are seen?
10. Group shapes by their mean duration
US has almost all types of shapes sighted - red colour predominates the plot.
Mean duration is less than 1000 minutes
Canada has an overwhelmingly large mean value for 'other' shape of UFO - probably skewed by some large values. (if we remove largest duration value in canada for 'other'- 1380000 minutes - mean = 287 only]
#plotting another column = converting seconds to minutes
df['duration (minutes)'] = df['duration (seconds)']/60
#pivot table
pd.pivot_table(df,index=['shape'],columns=['country'],values=['duration (minutes)'],aggfunc='mean').plot.bar(figsize=(20, 10))
Without country-wise segregation:
Crescent shaped UFO seen for highest mean duration and dome shaped UFO for least mean duration.
df[['shape','duration (minutes)']].groupby('shape').mean().sort_values(by='duration (minutes)', ascending = False)
import plotly.express as px
import plotly.graph_objects as go
Let's see the distribution of duration for which UFO's are sighted in terms of a pie chart:
Clearly, the time distribution is overwhelmingly in the US and Canada, with sightings for extremely small durations in Britain and Australia.
px.pie(df,names ='country',values='duration (minutes)',title='Duration(mean) of UFO sightings by country:')
Let's see if the distribution of number of sightings is also similar.
As seen below, number of sightings is even more one sided- with 95 percent of all sightings in US.
df_country_count = df['country'].value_counts().reset_index().rename(columns={'index':'country','country':'value'})
df_country_count
px.pie(df_country_count,names='country',values='value',title='Number of UFO sightings by country:')
--XX--
Let's visualize the top 10 cities in terms of the mean duration as a bar chart.
The text on top shows the country. Only one city in the top 20 is from Canada.
df_mean_duration = df[['country','city','duration (minutes)']].groupby(['city','country']).mean().sort_values(by = 'duration (minutes)',ascending=False).reset_index()
px.bar(df_mean_duration[:20],x='city',y='duration (minutes)',text='country',title='Top 20 cities for duration(mean) of UFO sightings:')
Let's check out the mean duration separately for top 20 cities of other 3 countries to see how they compare:
Starting off with Canada
df_ca=df[df['country']=='ca']
df_ca_mean_duration = df_ca[['city','duration (minutes)']].groupby(['city']).mean().sort_values(by = 'duration (minutes)',ascending=False).reset_index()
px.bar(df_ca_mean_duration[:20],x='city',y='duration (minutes)',text='duration (minutes)',title='Top 20 Canadian cities for duration(mean) of UFO sightings')
To Australia
df_au=df[df['country']=='au']
df_au_mean_duration = df_au[['city','duration (minutes)']].groupby(['city']).mean().sort_values(by = 'duration (minutes)',ascending=False).reset_index()
px.bar(df_au_mean_duration[:20],x='city',y='duration (minutes)',text = 'duration (minutes)',title='Top 20 Australian cities for duration(mean) of UFO sightings')
To Great Britain:
df_gb=df[df['country']=='gb']
df_gb_mean_duration = df_gb[['city','duration (minutes)']].groupby(['city']).mean().sort_values(by = 'duration (minutes)',ascending=False).reset_index()
px.bar(df_gb_mean_duration[:20],x='city',y='duration (minutes)',text = 'duration (minutes)',title='Top 20 British cities for duration(mean) of UFO sightings')
From the above 4 graphs, we can quickly figure out that -
While Great Britain and Canada are on a similar scale for duration of sightings,
UFO's in Canada are sighted for around 100x more time duration compared to Great Britain and Canada.
Similarly, UFO's in USA are sighted for around 100x more time duration than in Canada!
----XX-------
Let's do a scatterplot of the shapes by country:
The light shape is the most popular shaped UFO sighted - more than double the number of sightings than the next most popular shape - triangle.
df_us_shape_count = df[df['country']=='us']['shape'].value_counts().reset_index().rename(columns={'index':'shape','shape':'count'}) #df of shapes according to count
px.scatter(df_us_shape_count,x='shape',y='count',title='Distribution of shapes of UFO in USA by number of sightings')
Let's see if the distribution of shapes is similar for other countries as well.
Canada
df_ca_shape_count = df[df['country']=='ca']['shape'].value_counts().reset_index().rename(columns={'index':'shape','shape':'count'}) #df of shapes according to count
px.scatter(df_ca_shape_count,x='shape',y='count',title='Distribution of shapes of UFO in Canada by number of sightings')
Australia
df_au_shape_count = df[df['country']=='au']['shape'].value_counts().reset_index().rename(columns={'index':'shape','shape':'count'}) #df of shapes according to count
px.scatter(df_au_shape_count,x='shape',y='count',title='Distribution of shapes of UFO in Australia by number of sightings')
Great Britain
df_gb_shape_count = df[df['country']=='gb']['shape'].value_counts().reset_index().rename(columns={'index':'shape','shape':'count'}) #df of shapes according to count
px.scatter(df_gb_shape_count,x='shape',y='count',title='Distribution of shapes of UFO in Great Britain by number of sightings')
Looking at the above 4 plots, the difference in numbers is apparent.
However, 'light' remains the most commonly sighted shape in all 4 countries. In both USA and Canada, light is sighted more than double the number of times than the second most sighted shape.
The difference is less is Britain and Australia, but the data is also less for these countries - can be expected to follow a similar trend if more data is given.
In terms of common shapes - light, circle and triangle are the most commonly sighted shapes, again something that holds true for all countries.
---XX--
Choropleth for USA using the state codes given in the dataframe.
df['state_upper'] = df['state'].str.upper() #converting state codes to upper case
df['latitude'] = df['latitude'].str.replace('q','')
df['latitude']=df['latitude'].astype('float64')
#converting columns to float 64 and removing any miscellaneous strings present in the numbers
df_duration=df[['state_upper','duration (minutes)','latitude','longitude']].groupby(['state_upper']).mean().sort_values(by = 'duration (minutes)',ascending=False).reset_index()
#dataframe for mean duration of ufo sightings per states
px.choropleth(df_duration,color='duration (minutes)',locations='state_upper',locationmode='USA-states',color_continuous_scale='greens')
#no proper
Scatterplot on map showing distribution of duration of sightings!
px.scatter_geo(df_duration,lat = 'latitude',lon='longitude',color='duration (minutes)')
--XX-- Only one numerical value on map so couldn't come up with correlations, etc.